I was a trained classical pianist in my previous professional life. Remember those infomercials claiming that you could learn to play the piano in a flash? This has been an ongoing joke between my husband and me for years. From time to time, he threatens to achieve in mere four hours what took me years of blood, sweat, and tears!
So far, however, it has remained an empty threat (thankfully). I think most reasonable people understand these programs do not turn a complete newbie into a professional pianist in only a handful of hours.
I have always and strongly encouraged people to learn and enjoy playing the piano. It fosters appreciation for the work of other musicians more deeply and perhaps even collaborate under the right circumstances. However, offering the skill as a professional service for a fee is an entirely different matter. It would be irresponsible for me to encourage that to someone who has only some cursory training.
Can You Learn Data Science in a Flash?
The same is true of data science, or anything else for that matter. The COVID pandemic has accelerated the process of making digitization into a minimum necessity to stay competitive or even survive. We should encourage others to learn about data so they can appreciate it and be more intelligent about it. It is essential today.
However, the line between intelligent appreciation and hard skills is becoming increasingly blurry. Too many unsolicited “Learn Data Science in X Days/Weeks” ads continue to show up on my feeds. Together with the popularity of analytics “democratization,” it threatens the integrity and the long-term well-being of advanced analytics. Not to mention this is poor targeting—why are they targeting me, a 20-year veteran in analytics, for these ads??
Curiously, there is everything from a tacit acceptance to enthusiastic support for this in data science. I do not believe these short programs aim to convert a complete novice into a fully competent data scientist. That said, such expectations are rarely clearly articulated. The notion plays very well into the rapid-results culture that favors shortcuts.
I recognize that not everyone in these programs is starting from scratch. Those with adjacent backgrounds and just a few missing skills have a much better transition into this coveted discipline. Obviously, there are other factors as well.
We can question what one means by a “data scientist,” but that is for another time. It suffices to say what a business needs from a “data scientist” runs a wide gamut. Not all of these are about specific algorithms, programming languages, or even domain knowledge. Instead, I expect from any “data scientist” the following two technical competencies at the minimum.
#1: Solid grasp of probability basics.
Regardless of the method employed, statistical or otherwise, any data analysis design is heavily dependent on probability concepts. This does not necessarily require deep expertise in probability theory. However, you do need a strong grasp of probability basics and their implications.
I recently came across a post with a compare-and-contrast between statistics and machine learning. It stated flat out that probability was not critical in machine learning. This is problematic, because one of the major sources of bias is poor grasp of probability concepts and their implications. Bias in analytics is a hot topic today especially where it intersects with ethics. We might notionally understand it, but data scientists need to translate that into what to do with data.
This is a critical point. Data scientists generally work with data captured, not specifically for the analysis at hand but for some other operational (including transactional), administrative, etc., purposes. All analyses conducted using these data sources are secondary uses—what statisticians call “observational studies.”
It is not always evident that we make assumptions, some explicit and many implicit, for data analysis to be valid. Primary studies use data specifically collected for them, which helps reduce the number of assumptions. Good analysts are explicit about all assumptions, statistical as well as non-statistical, with justifications for each assumption.
Users of secondary data often do not make enough assumptions, which increases the risk from the analysis results. Assumptions we don’t even know we are making are often the riskiest. Taking a sample of the secondary data does not address this problem from the probability perspective. It is up to the data scientist to reconcile the purpose of the analysis with the purpose for which the data were originally accumulated, to understand the impact of the differences, and to reflect this in the analysis.
#2: Ability to logic your way through complex, previously unseen problems through programming, whatever the language.
A data scientist needs to be comfortable getting around very messy raw data, big or small. It is the science of data, after all. Except for obvious data management issues with basic documentation and technical errors, I disagree with the idea that data scientists spend too much time wrangling data. We often discover the implications of how the data came to be through wrangling. It is not unlike getting to know your future spouse from dating before getting married.
This is where being a good programmer is important. The specific language is secondary, if you understand its strengths and weaknesses. You can always learn another one if need be. Certain languages are indeed better for certain things than others, and this is often by design. You still must be able to logic your way through a messy pile of data with efficient and well-structured code.
Programming helps develop your ability to structure a complex logic. This is an extensible and critical skill, because today’s world is full of problems that are not well defined. Data scientists see data they have never seen before on a regular basis, and inability to logic your way through data is of limited utility outside of well-defined problems. It is relatively easy to spot a well-structured logic vs. the coding equivalent of a stream of consciousness.
Many misguided blogs have implied or even explicitly stated that data scientists had to program in Python. I once coded something entirely in Base SAS just to prove a point and because I could. Your tool may not be the best for the task at hand, but often there are valid reasons to use it over another. Too often, we get stuck in the tool and forget why we have a tool in the first place.
What about algorithms and techniques? What about knowledge?
I purposely left out algorithms and techniques from the criteria. This is where short courses are perfectly suited. You can always pick up methods and techniques quickly through whatever avenue that suits you. But this is provided you have a solid foundation, whose development is not measured in weeks if you are just getting started.
It is not necessarily about getting a degree from a certain institution or about completing a specific training program, either. Reputable and/or accredited programs obviously pose less risk for the student. However, it is about what the student really come to understand at the core of the acquired knowledge. While real-life experience is helpful to mature that understanding and you as a professional in general, it often gets too much credit toward evaluating job candidates. One of the most egregious is a job posting recently pointed out by someone, requiring more years of experience than the technology in question has existed!
As a hiring manager, I never asked detailed technical questions in candidate interviews or gave exams. I was more interested in what the candidate understood rather than what the candidate could recite within the confines of a well-defined problem. Data scientists need to solve problems never seen before, do so effectively, and be passionate about these fundamental concepts. I rarely had issues with the hires I made, and I had some great teams!
Can you learn data science in a flash? Like you can learn to play the piano in a flash.